This R Notebook presents an in-depth analysis and solution for the fraud detection challenge in e-commerce transactions, as part of the partnership between IEEE Computational Intelligence Society (IEEE-CIS) and Vesta Corporation.
This competition is a binary classification problem - i.e. our target variable is a binary attribute (Is the user making the click fraudlent or not?) and our goal is to classify users into “fraudlent” or “not fraudlent” as well as possible.
The notebook explores a large-scale dataset provided by Vesta Corporation, containing diverse features from device types to product details.
Using R, this notebook guides users through the entire data science pipeline, including data preprocessing, exploratory data analysis, feature engineering, model selection, and evaluation.
The notebook provides clear and concise code examples, accompanied by detailed explanations and insights into the dataset.
Kaggle Link: https://www.kaggle.com/c/ieee-fraud-detection
Dataset Link: https://drive.google.com/file/d/1ZqlRrTUZNao-I1lZNVztf206vp8QUkqV/view
Github Link: https://github.com/Sugumaran-Balasubramaniyan/IEEE-CIS-Fraud-Detection-using-R/tree/main
isFraud: This column indicates whether a transaction is fraudulent or not. It is the target variable, with a binary value of 1 for fraud and 0 for legitimate transactions.
TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
TransactionAMT: This column contains the transaction amount or value in USD
ProductCD: This column represents the product code or category associated with the transaction.
card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
addr1 and addr2: These columns represent address-related information, such as billing or shipping address.
dist1 and dist2: These columns indicate the distance between the transaction location and the address provided.
P_emaildomain and R_emaildomain: These columns contain email domain information for the purchaser (P) and recipient (R) of the transaction, respectively.
C1-C14: These columns represent numerical categorical features associated with the transaction, possibly derived from counting occurrences or frequencies, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
D1-D15: These columns contain time-related features that may represent the number of days elapsed since a specific event or transaction (timedelta, such as days between previous transaction, etc.)
M1-M9: These columns represent binary categorical features related to match status, indicating whether personal information associated with the transaction matches or not.(match, such as names on card and address, etc.)
V1-V339: Vesta engineered rich features, including ranking, counting, and other entity relations.
# install.packages("tidyverse")
# install.packages("keras")
# Loading the dataset from a specific file path
load("/Users/sugumaran/Documents/EM-LYON/Financial and Data Analysis with R/Final Project/transactions.rdata")
# To display the first few rows of a dataset
head(transactions)
# Viewing the column names available in the dataset
colnames(transactions)
[1] "isFraud" "TransactionDT" "TransactionAmt" "ProductCD" "card1" "card2"
[7] "card3" "card4" "card5" "card6" "addr1" "addr2"
[13] "dist1" "dist2" "P_emaildomain" "R_emaildomain" "C1" "C2"
[19] "C3" "C4" "C5" "C6" "C7" "C8"
[25] "C9" "C10" "C11" "C12" "C13" "C14"
[31] "D1" "D2" "D3" "D4" "D5" "D6"
[37] "D7" "D8" "D9" "D10" "D11" "D12"
[43] "D13" "D14" "D15" "M1" "M2" "M3"
[49] "M4" "M5" "M6" "M7" "M8" "M9"
[55] "V1" "V2" "V3" "V4" "V5" "V6"
[61] "V7" "V8" "V9" "V10" "V11" "V12"
[67] "V13" "V14" "V15" "V16" "V17" "V18"
[73] "V19" "V20" "V21" "V22" "V23" "V24"
[79] "V25" "V26" "V27" "V28" "V29" "V30"
[85] "V31" "V32" "V33" "V34" "V35" "V36"
[91] "V37" "V38" "V39" "V40" "V41" "V42"
[97] "V43" "V44" "V45" "V46" "V47" "V48"
[103] "V49" "V50" "V51" "V52" "V53" "V54"
[109] "V55" "V56" "V57" "V58" "V59" "V60"
[115] "V61" "V62" "V63" "V64" "V65" "V66"
[121] "V67" "V68" "V69" "V70" "V71" "V72"
[127] "V73" "V74" "V75" "V76" "V77" "V78"
[133] "V79" "V80" "V81" "V82" "V83" "V84"
[139] "V85" "V86" "V87" "V88" "V89" "V90"
[145] "V91" "V92" "V93" "V94" "V95" "V96"
[151] "V97" "V98" "V99" "V100" "V101" "V102"
[157] "V103" "V104" "V105" "V106" "V107" "V108"
[163] "V109" "V110" "V111" "V112" "V113" "V114"
[169] "V115" "V116" "V117" "V118" "V119" "V120"
[175] "V121" "V122" "V123" "V124" "V125" "V126"
[181] "V127" "V128" "V129" "V130" "V131" "V132"
[187] "V133" "V134" "V135" "V136" "V137" "V138"
[193] "V139" "V140" "V141" "V142" "V143" "V144"
[199] "V145" "V146" "V147" "V148" "V149" "V150"
[205] "V151" "V152" "V153" "V154" "V155" "V156"
[211] "V157" "V158" "V159" "V160" "V161" "V162"
[217] "V163" "V164" "V165" "V166" "V167" "V168"
[223] "V169" "V170" "V171" "V172" "V173" "V174"
[229] "V175" "V176" "V177" "V178" "V179" "V180"
[235] "V181" "V182" "V183" "V184" "V185" "V186"
[241] "V187" "V188" "V189" "V190" "V191" "V192"
[247] "V193" "V194" "V195" "V196" "V197" "V198"
[253] "V199" "V200" "V201" "V202" "V203" "V204"
[259] "V205" "V206" "V207" "V208" "V209" "V210"
[265] "V211" "V212" "V213" "V214" "V215" "V216"
[271] "V217" "V218" "V219" "V220" "V221" "V222"
[277] "V223" "V224" "V225" "V226" "V227" "V228"
[283] "V229" "V230" "V231" "V232" "V233" "V234"
[289] "V235" "V236" "V237" "V238" "V239" "V240"
[295] "V241" "V242" "V243" "V244" "V245" "V246"
[301] "V247" "V248" "V249" "V250" "V251" "V252"
[307] "V253" "V254" "V255" "V256" "V257" "V258"
[313] "V259" "V260" "V261" "V262" "V263" "V264"
[319] "V265" "V266" "V267" "V268" "V269" "V270"
[325] "V271" "V272" "V273" "V274" "V275" "V276"
[331] "V277" "V278" "V279" "V280" "V281" "V282"
[337] "V283" "V284" "V285" "V286" "V287" "V288"
[343] "V289" "V290" "V291" "V292" "V293" "V294"
[349] "V295" "V296" "V297" "V298" "V299" "V300"
[355] "V301" "V302" "V303" "V304" "V305" "V306"
[361] "V307" "V308" "V309" "V310" "V311" "V312"
[367] "V313" "V314" "V315" "V316" "V317" "V318"
[373] "V319" "V320" "V321" "V322" "V323" "V324"
[379] "V325" "V326" "V327" "V328" "V329" "V330"
[385] "V331" "V332" "V333" "V334" "V335" "V336"
[391] "V337" "V338" "V339"
# Initial exploration
# To display the structure of the dataset and information about its type and contents.
str(transactions)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 590540 obs. of 393 variables:
$ isFraud : num 0 0 0 0 0 0 0 0 0 0 ...
$ TransactionDT : num 86400 86401 86469 86499 86506 ...
$ TransactionAmt: num 68.5 29 59 50 50 ...
$ ProductCD : chr "W" "W" "W" "W" ...
$ card1 : num 13926 2755 4663 18132 4497 ...
$ card2 : num NA 404 490 567 514 555 360 490 100 111 ...
$ card3 : num 150 150 150 150 150 150 150 150 150 150 ...
$ card4 : chr "discover" "mastercard" "visa" "mastercard" ...
$ card5 : num 142 102 166 117 102 226 166 226 226 224 ...
$ card6 : chr "credit" "credit" "debit" "debit" ...
$ addr1 : num 315 325 330 476 420 272 126 325 337 204 ...
$ addr2 : num 87 87 87 87 87 87 87 87 87 87 ...
$ dist1 : num 19 NA 287 NA NA 36 0 NA NA 19 ...
$ dist2 : num NA NA NA NA NA NA NA NA NA NA ...
$ P_emaildomain : chr NA "gmail.com" "outlook.com" "yahoo.com" ...
$ R_emaildomain : chr NA NA NA NA ...
$ C1 : num 1 1 1 2 1 1 1 1 1 2 ...
$ C2 : num 1 1 1 5 1 1 1 1 1 2 ...
$ C3 : num 0 0 0 0 0 0 0 0 0 0 ...
$ C4 : num 0 0 0 0 0 0 0 0 0 0 ...
$ C5 : num 0 0 0 0 0 0 0 0 0 0 ...
$ C6 : num 1 1 1 4 1 1 1 1 1 3 ...
$ C7 : num 0 0 0 0 0 0 0 0 0 0 ...
$ C8 : num 0 0 0 0 1 0 0 0 1 0 ...
$ C9 : num 1 0 1 1 0 1 1 0 0 3 ...
$ C10 : num 0 0 0 0 1 0 0 0 1 0 ...
$ C11 : num 2 1 1 1 1 1 1 1 1 1 ...
$ C12 : num 0 0 0 0 0 0 0 0 0 0 ...
$ C13 : num 1 1 1 25 1 1 1 1 1 12 ...
$ C14 : num 1 1 1 1 1 1 1 1 1 2 ...
$ D1 : num 14 0 0 112 0 0 0 0 0 61 ...
$ D2 : num NA NA NA 112 NA NA NA NA NA 61 ...
$ D3 : num 13 NA NA 0 NA NA NA NA NA 30 ...
$ D4 : num NA 0 0 94 NA 0 0 0 NA 318 ...
$ D5 : num NA NA NA 0 NA NA NA NA NA 30 ...
$ D6 : num NA NA NA NA NA NA NA NA NA NA ...
$ D7 : num NA NA NA NA NA NA NA NA NA NA ...
$ D8 : num NA NA NA NA NA NA NA NA NA NA ...
$ D9 : num NA NA NA NA NA NA NA NA NA NA ...
$ D10 : num 13 0 0 84 NA 0 0 0 NA 40 ...
$ D11 : num 13 NA 315 NA NA 0 0 NA NA 302 ...
$ D12 : num NA NA NA NA NA NA NA NA NA NA ...
$ D13 : num NA NA NA NA NA NA NA NA NA NA ...
$ D14 : num NA NA NA NA NA NA NA NA NA NA ...
$ D15 : num 0 0 315 111 NA 0 0 0 NA 318 ...
$ M1 : logi TRUE NA TRUE NA NA TRUE ...
$ M2 : logi TRUE NA TRUE NA NA TRUE ...
$ M3 : logi TRUE NA TRUE NA NA TRUE ...
$ M4 : chr "M2" "M0" "M0" "M0" ...
$ M5 : logi FALSE TRUE FALSE TRUE NA FALSE ...
$ M6 : logi TRUE TRUE FALSE FALSE NA TRUE ...
$ M7 : logi NA NA FALSE NA NA NA ...
$ M8 : logi NA NA FALSE NA NA NA ...
$ M9 : logi NA NA FALSE NA NA NA ...
$ V1 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V2 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V3 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V4 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V5 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V6 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V7 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V8 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V9 : num 1 NA 1 NA NA 1 1 NA NA 1 ...
$ V10 : num 0 NA 0 NA NA 0 0 NA NA 0 ...
$ V11 : num 0 NA 0 NA NA 0 0 NA NA 0 ...
$ V12 : num 1 0 1 1 NA 1 1 0 NA 1 ...
$ V13 : num 1 0 1 1 NA 1 1 0 NA 1 ...
$ V14 : num 1 1 1 1 NA 1 1 1 NA 1 ...
$ V15 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V16 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V17 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V18 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V19 : num 1 1 1 1 NA 1 1 1 NA 1 ...
$ V20 : num 1 1 1 1 NA 1 1 1 NA 1 ...
$ V21 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V22 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V23 : num 1 1 1 1 NA 1 1 1 NA 1 ...
$ V24 : num 1 1 1 1 NA 1 1 1 NA 1 ...
$ V25 : num 1 1 1 1 NA 1 1 1 NA 1 ...
$ V26 : num 1 1 1 1 NA 1 1 1 NA 1 ...
$ V27 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V28 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V29 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V30 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V31 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V32 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V33 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V34 : num 0 0 0 0 NA 0 0 0 NA 0 ...
$ V35 : num NA 0 1 1 NA 1 1 0 NA 1 ...
$ V36 : num NA 0 1 1 NA 1 1 0 NA 1 ...
$ V37 : num NA 1 1 1 NA 1 1 1 NA 1 ...
$ V38 : num NA 1 1 1 NA 1 1 1 NA 1 ...
$ V39 : num NA 0 0 0 NA 0 0 0 NA 0 ...
$ V40 : num NA 0 0 0 NA 0 0 0 NA 0 ...
$ V41 : num NA 1 1 1 NA 1 1 1 NA 1 ...
$ V42 : num NA 0 0 0 NA 0 0 0 NA 0 ...
$ V43 : num NA 0 0 0 NA 0 0 0 NA 0 ...
$ V44 : num NA 1 1 1 NA 1 1 1 NA 1 ...
$ V45 : num NA 1 1 1 NA 1 1 1 NA 1 ...
[list output truncated]
Summary of the dataset structure:
# Loading required library
library(data.table)
# Function to find the missing values in a dataset
find_missing_values <- function(data) {
# To Convert data frame to data table
dt <- as.data.table(data)
# To Calculate count and percentage of missing values for each variable
missing_counts <- dt[, lapply(.SD, function(x) sum(is.na(x))), .SDcols = names(dt)]
missing_percentages <- dt[, lapply(.SD, function(x) sum(is.na(x)) / .N * 100), .SDcols = names(dt)]
# To Combine the results into a data table
missing_data <- data.table(
Variable = names(dt),
Missing_Count = unlist(missing_counts),
Missing_Percentage = unlist(missing_percentages)
)
# To Return the resulting data table
return(missing_data)
}
# Calling the function to find missing values on 'transactions' dataset
missing_values_table <- find_missing_values(transactions)
# Print the resulting data table
print(missing_values_table)
# Function to remove the columns with more than 40% of missing values in a given dataset
remove_columns_with_missing <- function(data, threshold = 0.4) {
# To Calculate the number of missing values in each column
missing_counts <- colSums(is.na(data))
# To Calculate the percentage of missing values in each column
missing_percentages <- missing_counts / nrow(data)
# To Identify columns with missing percentages above the threshold
columns_to_remove <- names(missing_percentages[missing_percentages > threshold])
# To Remove columns with missing percentages above the threshold
cleaned_data <- data[, !names(data) %in% columns_to_remove]
# To Return the cleaned dataset
return(cleaned_data)
}
# Calling the function to remove columns that has more than 40% of missing values on 'transactions' dataset
cleaned_data <- remove_columns_with_missing(transactions)
# Viewing the dataset after removing columns
head(cleaned_data)
# Calling the function to find missing values on 'cleaned_data' dataset
missing_values_table_clean_data <- find_missing_values(cleaned_data)
# Printing the resulting data table
print(missing_values_table_clean_data)
# Get the number of rows and columns in the cleaned dataset
num_rows <- dim(cleaned_data)[1]
num_columns <- dim(cleaned_data)[2]
# Print the results
cat("There were", num_columns, "columns with more than 40% of missing values in the dataset and they are removed.\n")
There were 201 columns with more than 40% of missing values in the dataset and they are removed.
cat("The 'cleaned_data' dataset contains", num_rows, "rows and", num_columns, "columns.\n")
The 'cleaned_data' dataset contains 590540 rows and 201 columns.
# Loading the required library
library(ggplot2)
# Calculating the count of each category
count_data <- data.frame(table(cleaned_data$isFraud))
# Plot the bar plot with count labels
ggplot(count_data, aes(x = factor(Var1), y = Freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = Freq), vjust = -0.5, size = 3, color = "black") +
labs(x = "isFraud", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Fraudulent Transactions")
Notice how imbalanced is the dataset!
Imbalance means that the number of data points available for different the classes is different.
Most of the transactions are non-fraud. If we use this dataset as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will “assume” that most transactions are not fraud.
# Convert TransactionDT column to datetime format
cleaned_data$TransactionDT <- as.POSIXct(cleaned_data$TransactionDT, origin = "1970-01-01", tz = "UTC")
# Find the maximum and minimum TransactionDT values
max_transactionDT <- max(cleaned_data$TransactionDT)
min_transactionDT <- min(cleaned_data$TransactionDT)
# Calculate the difference between min and max TransactionDT values in months
dt_difference <- round(as.numeric(as.Date(max_transactionDT) - as.Date(min_transactionDT)) / 30.436875)
# Display the maximum, minimum, and difference in TransactionDT values
print(max_transactionDT)
[1] "1970-07-02 23:58:51 UTC"
print(min_transactionDT)
[1] "1970-01-02 UTC"
cat(dt_difference, "months")
6 months
The time difference between the maximum and minimum values of the “TransactionDT” column is approximately 6 months.
head(cleaned_data)
library(ggplot2)
library(plotly)
# Convert TransactionDT to POSIXct format
cleaned_data$TransactionDT <- as.POSIXct(cleaned_data$TransactionDT, origin = "1970-01-01", tz = "UTC")
# Create a histogram plot
histogram <- ggplot(cleaned_data, aes(x = TransactionDT)) +
geom_histogram(fill = "steelblue", color = "white") +
labs(x = "TransactionDT", y = "Count") +
ggtitle("Histogram of TransactionDT") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_datetime(date_labels = "%b", date_breaks = "1 month")# Center the title
# Convert the plot to an interactive version
interactive_histogram <- ggplotly(histogram)
# Display the interactive histogram
interactive_histogram
library(dplyr)
library(ggplot2)
library(plotly)
# Convert TransactionDT to month format
cleaned_data$TransactionMonth <- format(cleaned_data$TransactionDT, "%Y-%m")
# Count the transactions by month
transaction_counts <- cleaned_data %>%
group_by(TransactionMonth) %>%
summarise(Count = length(TransactionMonth))
# Find the month with the highest transaction count
highest_month <- transaction_counts %>%
filter(Count == max(Count)) %>%
pull(TransactionMonth)
# Plot the transaction counts by month
interactive_plot <- ggplot(transaction_counts, aes(x = TransactionMonth, y = Count)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Month", y = "Transaction Count") +
ggtitle("Transactions by Month") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = Count), vjust = -0.5, size = 4) +
annotate("text", x = highest_month, y = transaction_counts$Count[transaction_counts$TransactionMonth == highest_month] + 50,
label = paste(transaction_counts$Count[transaction_counts$TransactionMonth == highest_month]), vjust = -1)
# Convert the plot to an interactive version using plotly
interactive_plot <- ggplotly(interactive_plot)
# Display the interactive plot
interactive_plot
We can see that the month of January has the highest number of transactions with 134339 transactions. The month of July has the lowest number of transactions with 5493 transactions, this is because the dataset has transactions for only 2 days for the month of July
head(cleaned_data)
library(ggplot2)
# Plot the distribution of TransactionAmt with smooth KDE
transaction_distribution <- ggplot(cleaned_data, aes(x = TransactionAmt)) +
geom_density(fill = "steelblue", color = "white") +
labs(x = "Transaction Amount", y = "Density") +
ggtitle("Distribution of Transaction Amount") +
theme(plot.title = element_text(hjust = 0.5))
# Display the plot
print(transaction_distribution)
library(ggplot2)
# Plot the distribution of log(TransactionAmt) with smooth KDE
transaction_distribution <- ggplot(cleaned_data, aes(x = log(TransactionAmt))) +
geom_density(fill = "steelblue", color = "white") +
labs(x = "Log(Transaction Amount)", y = "Density") +
ggtitle("Distribution of Log(Transaction Amount)") +
theme(plot.title = element_text(hjust = 0.5))
# Display the plot
print(transaction_distribution)
library(ggplot2)
# Plot the distribution of TransactionAmt based on isFraud
transaction_distribution <- ggplot(cleaned_data, aes(x = log(TransactionAmt), fill = as.factor(isFraud))) +
geom_density(alpha = 0.5) +
labs(x = "Transaction Amount", y = "Density", fill = "Is Fraud") +
ggtitle("Log Distribution of Transaction Amount by Fraud or Not Fraud") +
theme(plot.title = element_text(hjust = 0.5))
# Display the plot
print(transaction_distribution)
library(ggplot2)
# Plot the bar plot of ProductCD
product_barplot <- ggplot(cleaned_data, aes(x = ProductCD)) +
geom_bar(fill = "steelblue") +
labs(x = "Product Code", y = "Count") +
ggtitle("Bar Plot of ProductCD") +
theme(plot.title = element_text(hjust = 0.5))
# Display the plot
print(product_barplot)
The Product Code “W” occurs most frequently in the dataset.
library(ggplot2)
# Plot the bar plot of ProductCD based on isFraud
product_barplot <- ggplot(cleaned_data, aes(x = ProductCD, fill = factor(isFraud))) +
geom_bar() +
labs(x = "Product Code", y = "Count", fill = "isFraud") +
ggtitle("Bar Plot of ProductCD by isFraud") +
theme(plot.title = element_text(hjust = 0.5))
# Display the plot
print(product_barplot)
library(ggplot2)
# Plot the bar plot of card4 based on isFraud
card4_barplot <- ggplot(cleaned_data, aes(x = card4, fill = factor(isFraud))) +
geom_bar() +
labs(x = "Card Network", y = "Count", fill = "isFraud") +
ggtitle("Bar Plot of Card Network (card4) by isFraud") +
theme(plot.title = element_text(hjust = 0.5))
# Display the plot
print(card4_barplot)
The VISA network has the most transactions and hence there are high fraudulent cases in the VISA network category
library(ggplot2)
library(dplyr)
# Calculate the count of each combination of card6 and isFraud
card6_counts <- cleaned_data %>%
group_by(card6, isFraud) %>%
summarise(count = n()) %>%
ungroup()
# Plot the bar plot of card6 based on isFraud with count values
card6_barplot <- ggplot(card6_counts, aes(x = card6, y = count, fill = factor(isFraud))) +
geom_bar(stat = "identity") +
geom_text(aes(label = count), vjust = -0.5, color = "black", size = 3) +
labs(x = "Card Type", y = "Count", fill = "isFraud") +
ggtitle("Bar Plot of Card Type (card6) by isFraud") +
theme(plot.title = element_text(hjust = 0.5))
# Display the plot
print(card6_barplot)
Debit Card has the highest occurrences in the dataset and also the highest fraudulent cases with 10674 frauds followed by Credit card with 9950 frauds and 139036 occurrences.
# Find the top 10 email domains
top_10_domains <- cleaned_data %>%
filter(!is.na(P_emaildomain)) %>%
count(P_emaildomain) %>%
arrange(desc(n)) %>%
head(10)
# Calculate the total frequency
total_frequency <- sum(top_10_domains$n)
# Create a data frame with domain names and frequencies
top_10_domains_df <- data.frame(EmailDomain = top_10_domains$P_emaildomain, Frequency = top_10_domains$n)
# Add column to count number of frauds
top_10_domains_df$isFraud <- sapply(top_10_domains_df$EmailDomain, function(domain) {
sum(cleaned_data$P_emaildomain == domain & cleaned_data$isFraud == 1, na.rm = TRUE)
})
# Add percentage columns
top_10_domains_df$Frequency_Percentage <- (top_10_domains_df$Frequency / total_frequency) * 100
top_10_domains_df$isFraud_Percentage <- (top_10_domains_df$isFraud / top_10_domains_df$Frequency) * 100
# Display the top 10 email domains with frequencies, percentages, and number of frauds
print(top_10_domains_df)
NA
# Sort the top 10 domains by isFraud_Percentage in descending order
top_10_domains_df <- top_10_domains_df[order(-top_10_domains_df$isFraud_Percentage), ]
# Display the sorted table
print(top_10_domains_df)
NA
The above table displays the top 10 email domains based on their frequency in the dataset. Each row represents a specific email domain and provides information about the frequency, number of fraud cases (“isFraud”), percentage of occurrences, and the percentage of fraud cases (“isFraud_Percentage”) for that domain.
Here’s the explanation of each column:
EmailDomain: The name of the email domain. Frequency: The total number of occurrences of that email domain in the dataset. isFraud: The number of fraud cases associated with that email domain. Percentage: The percentage of occurrences of that email domain out of the total occurrences. isFraud_Percentage: The percentage of fraud cases out of the total occurrences for that email domain.
For example, let’s take the row with EmailDomain “outlook.com” because even though “gmail.com” has the highest frequency it has only 4.3541% of isFraud_Percentage.
Frequency: 5096 indicates that “outlook.com” appears 5096 times in the dataset. isFraud: 482 means that out of those 5096 occurrences of “outlook.com”, 482 cases are classified as fraud. Percentage: 1.0907488% represents the percentage of occurrences of “outlook.com” out of the total occurrences of all email domains. isFraud_Percentage: 9.4583987% indicates the percentage of fraud cases out of the total occurrences of “outlook.com”.
You can interpret the other rows in a similar manner, where the values for each domain provide insights into their frequency, fraud cases, and the corresponding percentages.
head(cleaned_data)
# Removing rows with missing values
df_clean <- na.omit(cleaned_data)
# Viewing the dataset after removing missing values
head(df_clean)
# Finding the missing values in the clean dataset
find_missing_values(df_clean)
# Loading the required library
library(ggplot2)
# Calculating the count of each category
count_data <- data.frame(table(df_clean$isFraud))
# Plot the bar plot with count labels
ggplot(count_data, aes(x = factor(Var1), y = Freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = Freq), vjust = -0.5, size = 3, color = "black") +
labs(x = "isFraud", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Fraudulent Transactions")
head(df_clean)
# Replacing TRUE and FALSE with 1 and 0 in column M6
df_clean$M6 <- as.integer(df_clean$M6)
# Verify the updated values
head(df_clean$M6)
[1] 1 0 0 1 0 0
head(df_clean)
# Defining the list of columns to remove
columns_to_remove <- c("ProductCD","TransactionDT", "TransactionMonth")
# Removing the columns from the dataset
df_clean <- df_clean[, !(names(df_clean) %in% columns_to_remove)]
# Categorical variables
categorical_vars <- c( "card4", "card6", "P_emaildomain")
categorical_vars
[1] "card4" "card6" "P_emaildomain"
# Categorical variables
categorical_vars <- c("card4", "card6", "P_emaildomain")
# Converting categorical variables to factors
df_clean[categorical_vars] <- lapply(df_clean[categorical_vars], as.factor)
# Performing one-hot encoding
df_encoded <- model.matrix(~.-1, data = df_clean[, categorical_vars])
# Combining encoded variables with the original dataset
df_encoded <- cbind(df_clean, df_encoded)
# Removing the original categorical variables
df_encoded <- df_encoded[, !(names(df_encoded) %in% categorical_vars)]
# Printing the encoded dataset
print(df_encoded)
NA
library(caret)
# Setting the seed for reproducibility
set.seed(123)
# Performing stratified sampling
sampled_indices <- createDataPartition(df_encoded$isFraud, p = 0.5, list = FALSE)
# Obtaining the stratified sample
stratified_sample <- df_encoded[sampled_indices, ]
head(stratified_sample)
# Install and load required packages
# install.packages("doParallel")
library(randomForest)
library(doParallel)
# Set the number of cores for parallel processing
num_cores <- parallel::detectCores() - 1 # Use all available cores except one
cl <- makeCluster(num_cores)
registerDoParallel(cl)
# Split the data into input features (x) and target variable (y)
x <- stratified_sample[, !names(stratified_sample) %in% "isFraud"] # Exclude the "isFraud" column
y <- stratified_sample$isFraud
# Fit the random forest classification model with parallel processing
rf_model <- randomForest(x, y, ntree = 5, mtry = sqrt(ncol(x)), type = "classification")
# Stop the parallel processing
stopCluster(cl)
registerDoSEQ()
# Extract feature importance
importance <- importance(rf_model)
# Get the top 10 important features
top_10_indices <- order(importance, decreasing = TRUE)[1:10]
top_10_indices
[1] 10 2 1 8 3 28 26 27 25 23
top_10_indices = c(10, 2, 1, 8, 3, 28, 26, 27, 25, 23)
top_10_indices
[1] 10 2 1 8 3 28 26 27 25 23
# Get the column names based on the indices
selected_columns <- colnames(stratified_sample)[top_10_indices]
# Print the selected columns
print(selected_columns)
[1] "addr2" "TransactionAmt" "isFraud" "card6" "card1" "D10"
[7] "D1" "D4" "C14" "C12"
df_selected <- df_clean
# Preparing the data
x <- df_selected[, -which(names(df_selected) == "isFraud")] # Features (excluding target variable)
y <- df_selected$isFraud # Target variable
# Splitting the data into training and testing sets
set.seed(123) # Set seed for reproducibility
train_indices <- createDataPartition(y, p = 0.9, list = FALSE)
train_x <- x[train_indices, ]
train_y <- y[train_indices]
test_x <- x[-train_indices, ]
test_y <- y[-train_indices]
nb_feat <- ncol(train_x) # Nb features
mono_const <- rep(0, nb_feat)
mono_const[1] <- 1
train_params <- list(
num_leaves = 15, # Max nb leaves in tree
learning_rate = 0.1, # Learning rate
objective = "binary", # Loss function
max_depth = 4, # Max depth of trees
min_data_in_leaf = 50, # Nb points in leaf
bagging_fraction = 0.5, # % of observations
feature_fraction = 0.7, # % of features
nthread = 4, # Parallelization
boosting = "dart", # DART = dropping
drop_rate = 0.1, # Dropping rate
lambda_l1 = 0.3, # Penalizing leave norms
seed = 42, # For reproducibility?
# early stopping not available with DARTs
#early_stopping_round = 10, # Early stopping after X round if no improvement
monotone_constraints = mono_const,
force_row_wise = T
)
train_x
train_y
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[57] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
[113] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[169] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[225] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[337] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[393] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[449] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[505] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
[561] 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0
[617] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
[673] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[729] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[785] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
[841] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[897] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[953] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[ reached getOption("max.print") -- omitted 294379 entries ]
library(lightgbm)
bst <- lightgbm(
data = train_x |> as.matrix(),
label = train_y, # Target / label
params = train_params, # Passing parameter values
nrounds = 40 # Number of trees in the model
)
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12305
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0915593"
[1] "[2]: train's binary_logloss:0.0890933"
[1] "[3]: train's binary_logloss:0.0874202"
[1] "[4]: train's binary_logloss:0.0861304"
[1] "[5]: train's binary_logloss:0.0844836"
[1] "[6]: train's binary_logloss:0.083404"
[1] "[7]: train's binary_logloss:0.0835863"
[1] "[8]: train's binary_logloss:0.0826888"
[1] "[9]: train's binary_logloss:0.0819895"
[1] "[10]: train's binary_logloss:0.0823384"
[1] "[11]: train's binary_logloss:0.0816255"
[1] "[12]: train's binary_logloss:0.0818268"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[13]: train's binary_logloss:0.0812364"
[1] "[14]: train's binary_logloss:0.0806198"
[1] "[15]: train's binary_logloss:0.0801187"
[1] "[16]: train's binary_logloss:0.0795637"
[1] "[17]: train's binary_logloss:0.0797841"
[1] "[18]: train's binary_logloss:0.079305"
[1] "[19]: train's binary_logloss:0.0793175"
[1] "[20]: train's binary_logloss:0.0793683"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[21]: train's binary_logloss:0.0796639"
[1] "[22]: train's binary_logloss:0.0863352"
[1] "[23]: train's binary_logloss:0.0851053"
[1] "[24]: train's binary_logloss:0.102389"
[1] "[25]: train's binary_logloss:0.100866"
[1] "[26]: train's binary_logloss:0.0978182"
[1] "[27]: train's binary_logloss:0.0951007"
[1] "[28]: train's binary_logloss:0.0950092"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[29]: train's binary_logloss:0.0947795"
[1] "[30]: train's binary_logloss:0.0924413"
[1] "[31]: train's binary_logloss:0.0904298"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[32]: train's binary_logloss:0.0913073"
[1] "[33]: train's binary_logloss:0.0894771"
[1] "[34]: train's binary_logloss:0.0878257"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[35]: train's binary_logloss:0.086332"
[1] "[36]: train's binary_logloss:0.0850731"
[1] "[37]: train's binary_logloss:0.0852747"
[1] "[38]: train's binary_logloss:0.0959606"
[1] "[39]: train's binary_logloss:0.0934406"
[1] "[40]: train's binary_logloss:0.0912431"
cv_model <- lgb.cv(
params = train_params,
data = train_x |> as.matrix(),
label = train_y, # Target / label
eval_freq = 80,
nrounds = 3, # Still number of trees
nfold = 5
)
[LightGBM] [Info] Number of positive: 4643, number of negative: 231660
[LightGBM] [Info] Total Bins 12305
[LightGBM] [Info] Number of data points in the train set: 236303, number of used features: 149
[LightGBM] [Info] Number of positive: 4711, number of negative: 231592
[LightGBM] [Info] Total Bins 12305
[LightGBM] [Info] Number of data points in the train set: 236303, number of used features: 149
[LightGBM] [Info] Number of positive: 4692, number of negative: 231611
[LightGBM] [Info] Total Bins 12305
[LightGBM] [Info] Number of data points in the train set: 236303, number of used features: 149
[LightGBM] [Info] Number of positive: 4619, number of negative: 231685
[LightGBM] [Info] Total Bins 12305
[LightGBM] [Info] Number of data points in the train set: 236304, number of used features: 149
[LightGBM] [Info] Number of positive: 4691, number of negative: 231612
[LightGBM] [Info] Total Bins 12305
[LightGBM] [Info] Number of data points in the train set: 236303, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019649 -> initscore=-3.909910
[LightGBM] [Info] Start training from score -3.909910
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019936 -> initscore=-3.895077
[LightGBM] [Info] Start training from score -3.895077
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019856 -> initscore=-3.899200
[LightGBM] [Info] Start training from score -3.899200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019547 -> initscore=-3.915200
[LightGBM] [Info] Start training from score -3.915200
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019852 -> initscore=-3.899418
[LightGBM] [Info] Start training from score -3.899418
[1] "[1]: valid's binary_logloss:0.0918098+0.00202271"
[1] "[3]: valid's binary_logloss:0.0876396+0.0016735"
num_leaves <- c(5,30)
learning_rate <- c(0.01, 0.05, 0.2)
pars <- expand.grid(num_leaves, learning_rate)
num_leaves <- pars[,1]
learning_rate <- pars[,2]
train_func <- function(num_leaves, learning_rate, train_x){
train_params <- list( # First, the list of params
num_leaves = num_leaves, # Max nb leaves in tree
learning_rate = learning_rate, # Learning rate
objective = "binary", # Loss function
max_depth = 3, # Max depth of trees
min_data_in_leaf = 50, # Nb points in leaf
bagging_fraction = 0.5, # % of observations
feature_fraction = 0.7, # % of features
nthread = 4, # Parallelization
force_row_wise = T
)
# Next we train
bst <- lightgbm(
data = train_x |> as.matrix(),
label = train_y, # Target / label
params = train_params, # Passing parameter values
eval_freq = 50,
nrounds = 10 # Number of trees in the model
)
# Next, we record the final loss (depends on the model/loss defined above)
return(loss = bst$record_evals$train$binary_logloss$eval[[10]])
}
train_func(10, 0.1, train_x) # Testing
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12303
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0925521"
[1] "[10]: train's binary_logloss:0.083271"
[1] 0.08327097
# install.packages("purrr") # Install purrr if not already installed
library(purrr) # Load the purrr package
grd <- pmap(list(num_leaves, learning_rate), # Parameters for the grid search
train_func, # Function on which to apply the grid search
train_x = train_x # Non-changing argument (data is fixed)
)
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12303
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0966762"
[1] "[10]: train's binary_logloss:0.0940843"
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12303
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0966141"
[1] "[10]: train's binary_logloss:0.0935996"
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12303
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0949216"
[1] "[10]: train's binary_logloss:0.0880179"
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12303
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0946179"
[1] "[10]: train's binary_logloss:0.0870028"
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12303
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0914336"
[1] "[10]: train's binary_logloss:0.0816907"
[LightGBM] [Info] Number of positive: 5839, number of negative: 289540
[LightGBM] [Info] Total Bins 12303
[LightGBM] [Info] Number of data points in the train set: 295379, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.019768 -> initscore=-3.903734
[LightGBM] [Info] Start training from score -3.903734
[1] "[1]: train's binary_logloss:0.0903837"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[10]: train's binary_logloss:0.0802696"
grd <- bind_cols(pars, tibble(loss = grd))
lgb.importance(bst) |>
top_n(20, Gain) |>
ggplot(aes(x = Gain, y = reorder(Feature, Gain))) + geom_col(fill = "#22AABB", alpha = 0.7) +
theme_bw() + theme(axis.title.y = element_blank())
Looks like columns V317 and V308 are the most important features in the LGBM model
LGB_intepret <- lgb.interprete(bst, test_x |> data.matrix(), 1:2)
LGB_intepret
[[1]]
[[2]]
NA
lgb.plot.interpretation(
tree_interpretation_dt = LGB_intepret[[1L]]
, top_n = 20
)
https://www.kaggle.com/c/ieee-fraud-detection
https://www.kaggle.com/code/artgor/eda-and-models
https://www.kaggle.com/code/artgor/eda-and-models
https://www.kaggle.com/code/kabure/extensive-eda-and-modeling-xgb-hyperopt
https://www.kaggle.com/code/shahules/tackling-class-imbalance
https://www.kaggle.com/code/cdeotte/xgb-fraud-with-magic-0-9600
https://www.kaggle.com/code/jesucristo/fraud-complete-eda
https://www.kaggle.com/code/robikscube/ieee-fraud-detection-first-look-and-eda
https://www.kaggle.com/code/alijs1/ieee-transaction-columns-reference/notebook
Note: Used ChatGPT to debug errors in the code.